My project is centered on Data Science articles. These articles contain information on what is currently happening in our industry and how we talk about it. We primarily use theory to determine what trends are up and coming in our industry. I believe we can approach this from a more systematic perspective. We should perform a text analysis on Data Science articles published over the past 20 years so that we can visualize the most prominent topics and better understand how our industry has developed. Through this visualization we may be able to draw critical insights about where our industry is headed.
There appears to be significant changes over time in the main collections of words.
Add graphs of trends over time
## # A tibble: 10 x 7
## ...1 row_num word n exists total testdate2
## <dbl> <dbl> <chr> <dbl> <lgl> <dbl> <date>
## 1 2 1 computer 1 TRUE 1614 2021-06-10
## 2 112 16 data 1 TRUE 1303 2021-06-10
## 3 143 20 researchers 1 TRUE 1274 2021-06-10
## 4 53 9 ai 1 TRUE 905 2021-06-07
## 5 42 7 quantum 1 TRUE 788 2021-06-07
## 6 156 22 research 1 TRUE 776 2021-06-10
## 7 254 37 computing 1 TRUE 760 2021-06-07
## 8 1711 258 science 1 TRUE 753 2021-04-20
## 9 37 6 software 1 TRUE 725 2021-06-09
## 10 29 5 tech 1 TRUE 660 2021-06-13
Include link to finished dashboard. Dashboard is not finished at this time.
Background
Data Acquisition, Cleaning, and Exploration
Process
Dashboard
Data Science is a relatively new industry that has only recently begun to solidify. We can trace elements of Data Science far back into history, but only within the past few decades has computing technology advanced enough to make widespread statistical analyses feasible. Data Science as a field has shifted a lot over the past 20 years. Specifically, the way we talk about Data Science and the various concepts that are important.
What is the problem? Currently, we base our thoughts on the “next big thing” in Data Science on theory. Theory in of itself is not a bad place to start, but it’s not always right when it comes to predicting changes in the industry. We need to make sure that we are prepared as industry professionals to use and understand new technology as it emerges. Being behind the curve in Data Science can seriously hamper your capabilities.
The ultimate goal of this project is to generate an interactive dashboard using R-Shiny which allows the user to explore the most prevalent words throughout the sampled time period. Additionally, I hope to provide some general insights about how industry terminology has evolved over time on the default display of the app.
This should be expanded further
I intend to scrape text data from the abstracts of a newsletter related to Data Science over a period of approximately 20 year. I’ll start off by doing some more general text analysis on the article titles themselves. Then I can move into more complex methods and apply Principle Component Analysis (PCA) to aid in the analysis of each article’s abstract. I intend to generate a language consistency score which can then be visualized over time to show how industry language has shifted.
Before I dive into any modeling, I’ll need to explore and thoroughly clean my dataset. Since I’m scraping the data from the internet I will need to do an extensive amount of text processing before it is in a useable format. Once I have everything clean, it should be fairly simple to analyze it and generate the relevant graphs.
Quick markdown code walk-through (cliff notes version)
Still need to condense code to minimal levels. May be valuable to simply describe the cleaning process instead.
Any major findings from exploration?
The scraper did not do a perfect job collecting all the article information. There are occasional instances where nearly a month of data is missing. Go into more detail on what data is missing. Potentially explore why it’s missing.
Markdown code walk-through of method application
May consider just describing method and providing intermediate visuals
Discuss initial findings. Still in-progress. Would be a good idea to show trends over time here
Display findings in polished graphical form
The cool graphs are still in development.
Use R-Shiny to dev interactive dashboard with various graphs. The Shiny app is still under development and I’m not sure exactly how to integrate it into R-Markdown just yet.
Brief discussion on the results, ideally the visuals will do most of the talking.
Nothing to discuss just yet. I’m still digging into the results.